Python extractor: overlay support #20206

d10c · 2025-08-11T14:37:39Z

This PR adds overlay support to the Python extractor, but no overlay compilation (to be merged separately since it needs further testing, see this PR).

This PR also includes an initial pass at the discard predicates (see Overlay.qll), though these are ignored in full (non-overlay) evaluation; they probably still need to be tweaked, so I'm happy to move this commit to another PR and let this one be only about the extractor.

Roadmap:

Update the dbscheme
Implement path transformer support
Read the overlay-changes JSON file
Read/write base metadata (CODEQL_EXTRACTOR_<LANG>_OVERLAY_BASE_METADATA_{IN,OUT})

python/ql/lib/semmle/python/Overlay.qll

d10c · 2025-08-28T10:33:43Z

@tausbn I'm thinking this might be a good time to checkpoint this work and get it reviewed. In the last DCA run for full analysis on this PR (see above), overall analysis time is unaffected, though there are a few outstanding stage timing results that are probably noise.

Copilot

Pull Request Overview

This PR adds overlay support to the Python extractor by implementing infrastructure for incremental analysis through database overlays, without including overlay compilation functionality.

Key changes implemented:

Database schema updates to support overlay metadata and change tracking
Extractor modifications to handle overlay-specific file traversal and metadata management
Path transformer support using updated environment variables

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
python/ql/lib/semmlecode.python.dbscheme	Adds `databaseMetadata` and `overlayChangedFiles` relations for overlay support
python/ql/lib/semmle/python/Overlay.qll	Implements discard predicates to filter out obsolete entities during overlay analysis
python/extractor/semmle/traverser.py	Modifies file traversal to only process changed files during overlay extraction
python/extractor/semmle/worker.py	Adds support for writing base metadata output required for overlay operations
python/extractor/semmle/path_rename.py	Updates path transformer to support new `CODEQL_PATH_TRANSFORMER` environment variable

Copilot · 2025-08-28T10:34:10Z

python/extractor/semmle/traverser.py

+            with open(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') as f:
+                data = json.load(f)
+                changed_paths = data.get('changes', [])
+                self.overlay_changes = { os.path.abspath(p) for p in changed_paths }


The variable name self.overlay_changes is inconsistent with the other instance variables which use snake_case (self.exclude_paths, self.recurse_files, etc.). Consider renaming to self.overlay_changed_paths for consistency.

tausbn

Overall I think this looks good. 👍

Do we have any tests for this? I feel like we might want to have a few CLI Integration tests to check that the overlay JSON files are being applied correctly. (The integration tests live here: https://github.com/github/codeql/tree/main/python/extractor/cli-integration-test)

Also, don't forget to update the extractor version here: https://github.com/github/codeql/blob/main/python/extractor/semmle/util.py#L13
(In this case, I think bumping it to 7.1.4 would be fine. We don't really have fixed rules for how to increase the version. The most important thing is that it changes so that we can tell from the log output what version of the extractor we're running.)

tausbn · 2025-09-05T12:50:02Z

python/extractor/semmle/traverser.py

+        if 'CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES' in os.environ:
+            with open(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') as f:
+                data = json.load(f)


I'm debating whether we should have some exception handling here (substituting the empty list of changed files in case something goes wrong). Currently, if something ends up being messed up in the JSON, then I believe the whole extraction will just fail.

I don't have strong feelings about it, though.

Thanks for the review! I also don't have strong opinions about whether file reading should fail loudly or warn and continue with a default (None, i.e. full extraction). I guess I'll go for the latter. And also insert a logger statement with the value of the environment variable, as is the convention elsewhere in the extractor.

d10c · 2025-09-05T16:11:47Z

Do we have any tests for this? I feel like we might want to have a few CLI Integration tests to check that the overlay JSON files are being applied correctly. (The integration tests live here: https://github.com/github/codeql/tree/main/python/extractor/cli-integration-test)

There are basic integration tests here but they depend on overlay compilation (not part of this commit), and also I'm still running into some issues on Windows (it appears that the path transformer is not working correctly there—currently debugging that). So maybe merging this should wait until I have that sorted.

Otherwise, do you have an idea for an integration test for this functionality that doesn't also exercise complete overlay evaluation?

The new name is required by overlay support.

with direct or indirect location links in dbscheme.

…hangedFiles`

…nnotations

…error

And don't add slash to start of path patterns on Windows.

…t-in files On Windows, we're getting e.g. the following mismatches, which could be due to case differences: "Skipped built-in file C:\hostedtoolcache\windows\Python\3.13.7\x64\Lib\multiprocessing\forkserver.py" vs "Extracted file C:\hostedtoolcache\windows\Python\3.13.7\x64\lib\asyncio\streams.py"

d10c · 2025-09-10T18:47:51Z

I think I've figured out why path transformers weren't working on Windows and why built-in modules were being extracted (see latest commits). Now the integration test on the other PR passes.

The only remaining thing now is solving some tuple count regressions uncovered through DCA, but that can be done independently of this PR.

github-actions bot added the Python label Aug 11, 2025

d10c force-pushed the d10c/python-overlay branch 2 times, most recently from b18b9ce to 3015c12 Compare August 12, 2025 10:48

github-advanced-security bot found potential problems Aug 18, 2025

View reviewed changes

python/ql/lib/semmle/python/Overlay.qll Fixed Show fixed Hide fixed

python/ql/lib/semmle/python/Overlay.qll Fixed Show fixed Hide fixed

d10c force-pushed the d10c/python-overlay branch from b0c7a52 to b5c8338 Compare August 19, 2025 18:20

github-advanced-security bot found potential problems Aug 19, 2025

View reviewed changes

python/ql/lib/semmle/python/Overlay.qll Fixed Show fixed Hide fixed

d10c force-pushed the d10c/python-overlay branch from f75a392 to 63106c0 Compare August 20, 2025 14:32

d10c mentioned this pull request Aug 27, 2025

Python overlay compilation #20293

Closed

d10c force-pushed the d10c/python-overlay branch from 63106c0 to b3a1ba5 Compare August 27, 2025 08:42

d10c mentioned this pull request Aug 27, 2025

Python: overlay compilation d10c/codeql#1

Draft

d10c force-pushed the d10c/python-overlay branch from b3a1ba5 to fb23977 Compare August 27, 2025 08:59

d10c marked this pull request as ready for review August 28, 2025 10:33

d10c requested a review from a team as a code owner August 28, 2025 10:33

d10c requested review from Copilot and tausbn August 28, 2025 10:33

Copilot AI reviewed Aug 28, 2025

View reviewed changes

d10c mentioned this pull request Sep 1, 2025

Python: enable overlay compilation + extractor overlay support #20337

Draft

tausbn requested changes Sep 5, 2025

View reviewed changes

d10c added 10 commits September 10, 2025 20:38

Add overlay builtins to python dbscheme

04e44aa

Turn on overlay support in codeql-extractor.yml

6ce8d29

Add database upgrade/downgrade scripts

cfe314a

Support CODEQL_PATH_TRANSFORMER env var in python path renamer

1fd1535

The new name is required by overlay support.

Python extractor: in overlay mode, traverse only changed files

d4dc445

Write overlay metadata at end of extraction.

9d7de67

Discard predicates for dbscheme elements

b8fcb63

with direct or indirect location links in dbscheme.

Add synthetic data to dbscheme.stats for databaseMetadata/`overlayC…

b07a274

…hangedFiles`

Overlay.qll: remove overlay[local?] module; in favour of explicit a…

5328643

…nnotations

Extractor: fall back to full extraction on overlay changes json read …

2a23559

…error

d10c added 2 commits September 10, 2025 20:40

Path transformer: handle Windows-style paths

6692653

And don't add slash to start of path patterns on Windows.

d10c force-pushed the d10c/python-overlay branch from fb23977 to f309dc6 Compare September 10, 2025 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python extractor: overlay support #20206

Python extractor: overlay support #20206

d10c commented Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d10c commented Aug 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 28, 2025

Uh oh!

tausbn left a comment

Uh oh!

tausbn Sep 5, 2025

Uh oh!

d10c Sep 5, 2025

Uh oh!

d10c commented Sep 5, 2025

Uh oh!

d10c commented Sep 10, 2025

Uh oh!

Uh oh!

Python extractor: overlay support #20206

Are you sure you want to change the base?

Python extractor: overlay support #20206

Conversation

d10c commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d10c commented Aug 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

tausbn left a comment

Choose a reason for hiding this comment

Uh oh!

tausbn Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

d10c Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

d10c commented Sep 5, 2025

Uh oh!

d10c commented Sep 10, 2025

Uh oh!

Uh oh!

d10c commented Aug 11, 2025 •

edited

Loading